Skip to content

Add blob direct write with partitioned blob files#14457

Draft
xingbowang wants to merge 15 commits intofacebook:mainfrom
xingbowang:2026_03_04_blob_memtable_partition
Draft

Add blob direct write with partitioned blob files#14457
xingbowang wants to merge 15 commits intofacebook:mainfrom
xingbowang:2026_03_04_blob_memtable_partition

Conversation

@xingbowang
Copy link
Copy Markdown
Contributor

@xingbowang xingbowang commented Mar 12, 2026

Summary

Add a new blob direct write feature with partitioned blob files that writes blob values directly to blob files during Put(), bypassing both WAL and memtable for large values. Only the small (~30 byte) BlobIndex pointer is stored in WAL and memtable. This reduces WAL write amplification, memtable memory usage, and blob write lock contention for large-value workloads.

Motivation

With standard blob separation, full blob values are first written to WAL, then stored in the memtable, and only separated into blob files during flush. For workloads with large values (e.g., 4KB–1MB), this means the WAL and memtable carry the full value payload even though it will eventually be stored separately. This wastes WAL bandwidth, inflates memtable memory, and adds unnecessary write amplification.

Additionally, the existing blob file write path uses a single blob file writer per column family, which becomes a serialization bottleneck under concurrent write workloads. Partitioned blob files address this by spreading writes across multiple independent blob files, each with its own lock, enabling true parallel blob I/O from multiple writer threads.

Design

Write Path

  • DBImpl::Put() fast path: For single-key puts where the value exceeds min_blob_size, the blob is written directly to a blob file and a BlobIndex-only WriteBatch is constructed, avoiding full value serialization entirely.
  • DBImpl::WriteImpl() batch path: For multi-key WriteBatch operations, a BlobWriteBatchTransformer iterates the batch, writes qualifying values to blob files, and replaces them with BlobIndex entries before the batch enters WAL/memtable.

BlobFilePartitionManager

A new BlobFilePartitionManager manages partitioned blob files for concurrent writes:

  • Partitioned writes: Multiple blob file partitions (configurable via blob_direct_write_partitions) each with their own mutex, reducing lock contention for concurrent writers.
  • Deferred flush mode (blob_direct_write_buffer_size > 0): Zero-copy buffering where Slice references point directly into the WriteBatch buffer. Background threads flush to disk in batches, amortizing syscall overhead. Includes backpressure with stall watermarks.
  • Sync mode (blob_direct_write_buffer_size = 0): Immediate write-through for maximum durability.
  • Pluggable partition strategy: Custom BlobFilePartitionStrategy interface for key/value-aware partition assignment (default: round-robin).

Flush Integration

  • On memtable flush, BlobFilePartitionManager::SealAllPartitions() finalizes open blob files and injects BlobFileAddition entries into the flush VersionEdit, so blob files are registered in the MANIFEST atomically with the flush SST.
  • Handles mempurge: if a flush is switched to mempurge, sealed blob file additions are returned to the partition manager for the next flush.

Crash Recovery

  • Orphan blob file recovery in DBImpl::Open(): Scans for blob files not registered in the MANIFEST (e.g., from crashes before flush), reads their headers to determine column family, validates records, and registers them via VersionEdit. Runs regardless of current enable_blob_direct_write setting to handle DBs previously opened with the feature.
  • WAL replay produces BlobIndex entries pointing to these recovered blob files, ensuring no data loss.

Read Path

  • DBIter and ArenaWrappedDBIter extended to resolve BlobIndex entries from direct-write blob files.
  • Deferred flush mode includes a 4-tier read fallback: pending records → in-flight records → BlobFileCache → blob file read.

New Options

  • enable_blob_direct_write (bool, default: false) — master switch
  • blob_direct_write_partitions (uint32, default: 1) — number of concurrent blob file partitions
  • blob_direct_write_buffer_size (uint64, default: 4MB) — per-partition write buffer; 0 = sync mode
  • blob_direct_write_use_direct_io (bool, default: false) — O_DIRECT for blob writes
  • blob_direct_write_flush_interval_ms (uint64, default: 0) — periodic background flush interval
  • blob_direct_write_partition_strategy (shared_ptr, default: round-robin)

New Statistics

  • BLOB_DB_DIRECT_WRITE_COUNT — number of blobs written via direct write
  • BLOB_DB_DIRECT_WRITE_BYTES — bytes written via direct write
  • BLOB_DB_DIRECT_WRITE_STALL_COUNT — writer stalls due to backpressure
  • BLOB_DB_COMPRESSION_MICROS — blob compression timing

Testing

  • 61 new tests in db_blob_direct_write_test.cc covering: basic put/get, multi-get, concurrent writers, compression (with Snappy availability checks), crash recovery, orphan recovery, WAL recovery, snapshot isolation, transactions (including 2PC), backpressure, multiple column families, file rotation, statistics, event listeners, file checksums, direct I/O, sync/deferred flush modes, and error injection.
  • db_stress and db_crashtest.py integration for continuous randomized testing.
  • All existing blob tests updated to coexist with the new code paths.
  • Full make check passes (39,454 tests, 0 failures).

New Files

  • db/blob/blob_file_partition_manager.cc/.h — core partition manager (~1,700 lines)
  • db/blob/blob_write_batch_transformer.cc/.h — WriteBatch transformation logic
  • db/blob/db_blob_direct_write_test.cc — comprehensive test suite (~2,000 lines)
  • db/blob/blob_file_completion_callback.cc — SstFileManager and EventListener integration

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant